I will cluster the deputies from the biggest two parties depending on the voting, using two different distance calculations, and two different linkage methods, then I will visualize the data.
library(dplyr)
library(tidyr)
library(cluster)
library(ape)
library(RColorBrewer)
load("/Users/Ashmitha/Downloads/Subs/R/all_votes.rda")
firstly I will filter the data to keep the records for the biggest two parties “PO and PiS”. The last column which is about topic doesn’t seem to be important hence we ignore that.
# filtering based on club value
mydata = all_votes[all_votes$club %in% c('PO', 'PiS'),1:7]
Get unique deputies according to the name and the club, because maybe the same name maybe in different club. So we get the list of 398 deputies in PiS and PO.
unique_deputies <- unique(mydata[c("surname_name", "club")])
We get the count of votes by grouping the id_counting variable.
voting_count = mydata %>%
group_by(id_voting) %>%
summarise(
vote_count = n()
)
Find used votes in clustering I will count the distinct deputies to filter the votes which not exceed 75% of the deputies. to count the votes I will group by id_voting, then I will filter the data to get the final dataset I will use for clustering.
filtered_vote = voting_count[voting_count$vote_count > 0.75 * length(unique_deputies),]
# filter the dataset using the important votes "after filtration"
mydata <- mydata[mydata$id_voting %in% filtered_vote$id_voting,]
Vote type coding
all_votes_new = mydata %>%
mutate(
voting_type = (ifelse(vote=='Against', -1, ifelse(vote=='For', 0, 1 )))
)
all_votes_new = all_votes_new[,c("surname_name","club", "id_voting","voting_type")]
Create the final dataset using spread function
final_data = all_votes_new %>%
spread(id_voting, voting_type, fill=0)
Two different functions to calculate the distance using Id_voting values as coordinates: 1- euclidian distances 2- Manhattan distances
distinct_id_voting = unique(all_votes$id_voting)
rownames(final_data) = paste(final_data$club, final_data$surname_name, sep="_")
euclidian distances for original dataset
mat1 <- dist(final_data[,distinct_id_voting])
## Warning in dist(final_data[, distinct_id_voting]): NAs introduced by
## coercion
final_mat = as.matrix(mat1)
Manhattan distances for original dataset
mat2 <- dist(final_data[,distinct_id_voting], method = 'manhattan')
## Warning in dist(final_data[, distinct_id_voting], method = "manhattan"):
## NAs introduced by coercion
final_mat2 = as.matrix(mat2)
I will cluster the data using two different linkage method 1- average, I will apply it on the distance matrix I created using the euclidian distance function 2- complete, I will apply it on the distance matrix I created using the manhattan distance function
final_data$colr <- factor(ifelse(final_data$club == 'PO', 1, 2))
final_data2 <-final_data
# Using Average
hc <- agnes(final_mat, method="average")
plot(hc, which.plots=2, cex=0.5, main="")
final_data$labels = factor(cutree(hc, k=4))
# Using complete
hc2 <- agnes(final_mat2, method="complete")
plot(hc2, which.plots=2, cex=0.5, main="")
I have clustered all the deputies and used their names and club as the row name. Green Color for PO party Orange Color for PiS party
#Clusters Using Euclidian Distance And Average Linkage
cols <- brewer.pal(3,"Set2")
hc <- as.phylo(as.hclust(agnes(final_data, method="complete")))
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
par(mar=c(1,1,2,1), xpd=NA)
plot(hc, type = "fan", cex = 0.8,
tip.color = cols[final_data$colr])
plot(as.phylo(hc), type = "unrooted", cex = 0.8,
tip.color = cols[final_data$colr])
plot(as.phylo(hc), type = "radial", cex = 0.8,
tip.color = cols[final_data$colr])
plot(as.phylo(hc), type = "cladogram", cex = 0.8,
tip.color = cols[final_data$colr])
final_data2$labels = factor(cutree(hc2, k=4))
Because of the big number of deputies, the visualization will not be clear, and we can filter the deputies more to get more clear results.
Chose only deputies that were present during majority of votings from two largest parties (PO and PiS) only important votings more than 75% Tried the dendrogram for selected deputies and used colours Green for PO and Orange for PiS to present different parties.